Similarity Based Chinese Synonym Collocation Extraction

نویسندگان

  • Wanyin Li
  • Qin Lu
  • Ruifeng Xu
چکیده

Collocation extraction systems based on pure statistical methods suffer from two major problems. The first problem is their relatively low precision and recall rates. The second problem is their difficulty in dealing with sparse collocations. In order to improve performance, both statistical and lexicographic approaches should be considered. This paper presents a new method to extract synonymous collocations using semantic information. The semantic information is obtained by calculating similarities from HowNet. We have successfully extracted synonymous collocations which normally cannot be extracted using lexical statistics. Our evaluation conducted on a 60MB tagged corpus shows that we can extract synonymous collocations that occur with very low frequency and that the improvement in the recall rate is close to 100%. In addition, compared with a collocation extraction system based on the Xtract system for English, our algorithm can improve the precision rate by about 44%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Synonym Relations in Chinese Collocation Extraction

A challenging task in Chinese collocation extraction is to improve both the precision and recall rate. Most lexical statistical methods including Xtract face the problem of unable to extract collocations with lower frequencies than a given threshold. This paper presents a method where HowNet is used to find synonyms using a similarity function. Based on such synonym information, we have success...

متن کامل

A Hybrid Extraction Model for Chinese Noun/Verb Synonym bi-gram Collocations

Statistical-based collocation extraction approaches suffer from (1) low precision rate because high co-occurrence bi-grams may be syntactically unrelated and are thus not true collocations; (2) low recall rate because some true collocations with low occurrences cannot be identified successfully by statistical-based models. To integrate both syntactic rules as well as semantic knowledge into a s...

متن کامل

Chinese Entity Relation Extraction Based on Word Co-occurrence

Chinese entity relation extraction is a part of entity relation extraction. According to entity relation extraction technology and the features of Chinese news corpus, this paper proposes a novel method for Chinese entities relation extraction. The method, named WCORE (word co-occurrence relation extraction), first measures the semantic similarity by word co-occurrence and then adopts pattern m...

متن کامل

Chinese Sketch Engine and the Extraction of Grammatical Collocations

This paper introduces a new technology for collocation extraction in Chinese. Sketch Engine (Kilgarriff et al., 2004) has proven to be a very effective tool for automatic description of lexical information, including collocation extraction, based on large-scale corpus. The original work of Sketch Engine was based on BNC. We extend Sketch Engine to Chinese based on Gigaword corpus from LDC. We d...

متن کامل

Chinese-English Bilingual Word Semantic Similarity Based on Chinese WordNet

Semantic similarity measurement of multilingual words is a challenging problem in data mining, information extraction, information retrieval, etc. This paper introduces an algorithm to measure the semantic similarity of Chinese-English bilingual words based on Chinese WordNet, an expansion of WordNet in Simplified Chinese. The algorithm not only measures the semantic similarity for Chinese and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IJCLCLP

دوره 10  شماره 

صفحات  -

تاریخ انتشار 2005